Skip to main content

System Design: Top-K YouTube Videos

A comprehensive guide to designing a real-time system for tracking and displaying the most viewed YouTube videos using stream processing and probabilistic data structures.

1. Problem Understanding & Requirementsโ€‹

Functional Requirementsโ€‹

  • Track view counts for billions of videos in real-time
  • Return top-K (e.g., top 100) most viewed videos globally
  • Support time-windowed queries (last hour, day, week)
  • Handle video metadata updates
  • Support regional/category-specific top-K queries

Non-Functional Requirementsโ€‹

  • High Throughput: Process millions of views per second
  • Low Latency: Real-time processing with <1 second lag
  • Scalability: Horizontal scaling for growing traffic
  • Approximate Accuracy: 90-95% accuracy acceptable (trade-off for performance)
  • Availability: 99.99% uptime
  • Cost Efficiency: Optimize memory and compute resources

Scale Estimationโ€‹

Daily Active Users:        1 billion
Videos on Platform: 2 billion
Daily Video Views: 5 billion
Average Views/Second: ~58,000
Peak Views/Second: ~200,000
View Event Size: ~500 bytes
Daily Data Volume: ~2.5 TB

2. High-Level Architectureโ€‹

3. Core Components Deep Diveโ€‹

3.1 Data Modelโ€‹

View Event Schema:

{
"event_id": "uuid",
"video_id": "string",
"user_id": "string",
"timestamp": "long",
"region": "string",
"category": "string",
"watch_duration": "int",
"device_type": "string"
}

3.2 Stream Processing Pipelineโ€‹

Stream Processing Flow:

  1. Validation Stage: Filter duplicates, validate schema, check fraud
  2. Counting Stage: Update Count-Min Sketch for approximate counts
  3. Aggregation Stage: Maintain heap-based Top-K
  4. Windowing Stage: Time-based aggregations (tumbling/sliding windows)

4. Count-Min Sketch Implementationโ€‹

What is Count-Min Sketch?โ€‹

A probabilistic data structure that estimates frequency of events in a stream using sub-linear space.

Key Properties:

  • Space: O(w ร— d) where w = width, d = depth
  • Update Time: O(d)
  • Query Time: O(d)
  • Overestimates, never underestimates
  • Error bound: ฮต with probability ฮด

Count-Min Sketch Algorithmโ€‹

Parameters:

  • ฮต (epsilon): Error tolerance (e.g., 0.01 = 1% error)
  • ฮด (delta): Confidence (e.g., 0.99 = 99% confidence)
  • Width (w): โŒˆe/ฮตโŒ‰ โ‰ˆ 272 for ฮต=0.01
  • Depth (d): โŒˆln(1/ฮด)โŒ‰ โ‰ˆ 5 for ฮด=0.01

Update Operation:

For each hash function h_i (i = 1 to d):
index = h_i(video_id) mod w
counter[i][index] += 1

Query Operation:

estimated_count = MIN(counter[i][h_i(video_id) mod w]) for all i

Why Count-Min Sketch for YouTube?โ€‹

Memory Comparison:

ApproachMemory Required
Exact Counters (2B videos)16 GB (8 bytes ร— 2B)
Count-Min Sketch~10 MB (272 ร— 5 ร— 8 bytes)

Space Savings: 1,600x reduction!

5. Top-K Algorithm Designโ€‹

Heap-Based Approachโ€‹

Algorithm Steps:

  1. Initialize: Create min-heap of size K
  2. For each view event:
    • Update Count-Min Sketch
    • Get estimated count from CMS
    • If count > heap minimum:
      • Remove minimum from heap
      • Insert (video_id, count) into heap
  3. Periodically sync heap to Redis (every 1-5 seconds)

Space-Saving Algorithm (Alternative)โ€‹

Maintains K counters for heavy hitters with guaranteed error bounds:

  • Space: O(K)
  • Better accuracy than pure Count-Min Sketch
  • Combines frequency estimation with Top-K

6. Time Window Processingโ€‹

Window Typesโ€‹

Multi-Window Architectureโ€‹

Implementation with Apache Flink:

DataStream<ViewEvent> views = ...;

// 1-hour tumbling window
DataStream<TopK> hourlyTopK = views
.keyBy(ViewEvent::getVideoId)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new CountMinSketchAggregator())
.process(new TopKFunction(100));

// 24-hour sliding window (slide every hour)
DataStream<TopK> dailyTopK = views
.keyBy(ViewEvent::getVideoId)
.window(SlidingEventTimeWindows.of(Time.hours(24), Time.hours(1)))
.aggregate(new CountMinSketchAggregator())
.process(new TopKFunction(100));

7. Database Designโ€‹

Redis Schema (Hot Data - Top-K Cache)โ€‹

Key Pattern: topk:{window}:{region}:{category}
Type: Sorted Set (ZSET)

Example:
topk:1h:global:all โ†’ [(video_1, 1M), (video_2, 900K), ...]
topk:24h:US:music โ†’ [(video_5, 5M), (video_9, 4.8M), ...]
topk:7d:global:gaming โ†’ [(video_12, 50M), ...]

Commands:
- ZADD topk:1h:global:all 1000000 video_1
- ZREVRANGE topk:1h:global:all 0 99 WITHSCORES # Get Top-100
- ZINCRBY topk:1h:global:all 1 video_1 # Increment

Cassandra Schema (Cold Data - Historical)โ€‹

CREATE TABLE video_view_counts (
video_id text,
time_bucket timestamp, -- Hourly buckets
window_type text, -- '1h', '24h', '7d'
region text,
category text,
view_count bigint,
PRIMARY KEY ((video_id, window_type), time_bucket, region, category)
) WITH CLUSTERING ORDER BY (time_bucket DESC);

CREATE TABLE top_k_snapshots (
window_type text,
time_bucket timestamp,
region text,
category text,
ranking list<frozen<video_rank>>,
PRIMARY KEY ((window_type, region, category), time_bucket)
) WITH CLUSTERING ORDER BY (time_bucket DESC);

-- Custom type
CREATE TYPE video_rank (
video_id text,
view_count bigint,
rank int
);

8. System Architecture Patternsโ€‹

8.1 Lambda Architecture Patternโ€‹

Benefits:

  • Speed layer: Low latency, approximate results
  • Batch layer: High accuracy, eventual consistency
  • Serving layer: Merges both for best of both worlds

8.2 Kappa Architecture Pattern (Simplified)โ€‹

Benefits:

  • Simpler architecture
  • Single processing pipeline
  • Reprocess by replaying Kafka

9. Optimizations & Trade-offsโ€‹

9.1 Partitioning Strategyโ€‹

Partitioning by video_id ensures:

  • All views for same video go to same partition
  • Enables local Count-Min Sketch per partition
  • Parallel processing across partitions

9.2 Hierarchical Aggregationโ€‹

Benefits:

  • Reduces network traffic
  • Pre-aggregation at edge
  • Faster global Top-K computation

9.3 Sampling Techniquesโ€‹

For extremely high traffic videos, use sampling:

If view_count > threshold:
sample_rate = base_rate / (view_count / threshold)
if random() < sample_rate:
process_event()
weight = 1 / sample_rate

10. Handling Edge Casesโ€‹

10.1 Viral Video Spikeโ€‹

Detection:

  • Monitor rate of change in view counts
  • Alert when growth > 500% in 15 minutes
  • Auto-scale processing capacity

10.2 Bot Detection & Fraud Preventionโ€‹

Fraud Detection Rules:

  • IP-based rate limiting (max N views/minute)
  • User session validation
  • Device fingerprinting
  • ML-based bot detection

10.3 Late Arriving Eventsโ€‹

Watermark Strategy:

Watermark = Max_Event_Time - Allowed_Lateness
Allowed_Lateness = 15 minutes

11. Monitoring & Observabilityโ€‹

Key Metricsโ€‹

Alerting Rules:

  • Processing lag > 10 seconds
  • Error rate > 0.1%
  • Redis hit rate < 95%
  • Query latency p99 > 100ms

12. API Designโ€‹

REST API Endpointsโ€‹

GET /api/v1/videos/top
Query Parameters:
- k: number (default: 100)
- window: enum (1h, 24h, 7d, 30d)
- region: string (default: global)
- category: string (default: all)

Response:
{
"window": "24h",
"region": "US",
"category": "music",
"timestamp": "2025-10-21T10:00:00Z",
"videos": [
{
"rank": 1,
"video_id": "abc123",
"title": "Viral Song",
"view_count": 5000000,
"thumbnail_url": "...",
"channel": "Popular Artist"
},
...
]
}

WebSocket for Real-time Updatesโ€‹

ws://api.youtube.com/v1/videos/top/stream
Message Format:
{
"type": "ranking_update",
"window": "1h",
"changes": [
{"video_id": "xyz789", "old_rank": 5, "new_rank": 3},
{"video_id": "abc123", "old_rank": null, "new_rank": 100}
],
"timestamp": "2025-10-21T10:30:15Z"
}

13. Cost Optimizationโ€‹

Resource Estimatesโ€‹

ComponentSpecificationMonthly Cost
Kafka Cluster20 brokers (m5.2xlarge)$8,000
Flink Cluster50 task managers (c5.4xlarge)$25,000
Redis Cluster10 nodes (r6g.2xlarge)$5,000
Cassandra30 nodes (i3.2xlarge)$15,000
Network100 TB egress$9,000
Total~$62,000/month

Cost Optimization Strategiesโ€‹

  1. Use Count-Min Sketch instead of exact counters (1,600x memory savings)
  2. Tiered storage: Redis (hot) โ†’ Cassandra (warm) โ†’ S3 (cold)
  3. Compression: Enable Kafka message compression (60% reduction)
  4. Reserved instances: Save 40% on compute costs
  5. Data retention: Keep only 30 days in Cassandra, archive to S3

14. Generic Top-K System Design Frameworkโ€‹

Universal Components for Any Top-K Problemโ€‹

Every Top-K system (trending tweets, popular products, hot searches, etc.) shares common patterns. Here's a reusable framework:

14.1 Problem Categories & Patternsโ€‹

Category 1: Heavy Hitters (Most Frequent Items)โ€‹

Examples: Top products, trending hashtags, popular searches

Decision Matrix:

CardinalityMemory BudgetLatencyAlgorithm Choice
< 100KHighLowHashMap + Min-Heap
100K-10MMediumLowCount-Min Sketch + Heap
> 10MLowMediumSpace-Saving + Sampling
AnyVery LowAnyLossy Counting

Examples: Trending topics, viral content, breaking news

Key Challenge: Balance recency vs popularity

Exponential Decay Formula:

score(t) = count * e^(-ฮป * (current_time - event_time))
ฮป = decay_rate (e.g., 0.0001 for hourly decay)

Category 3: Multi-Dimensional Top-Kโ€‹

Examples: Top products by category, top videos by region

Storage Example:

# Single dimension
topk:global โ†’ [item1, item2, ...]

# Multiple dimensions
topk:us:electronics โ†’ [item3, item5, ...]
topk:uk:books โ†’ [item7, item9, ...]

# With aggregation
topk:global โ†’ aggregate(all regions, all categories)

14.2 Universal Design Checklistโ€‹

Phase 1: Requirements Analysisโ€‹

โœ“ Define "Top-K" clearly
โ–ก By count/frequency?
โ–ก By score (weighted)?
โ–ก By revenue/value?
โ–ก By engagement (clicks, time)?

โœ“ Cardinality estimation
โ–ก Total unique items: ___
โ–ก Active items per window: ___
โ–ก Expected growth rate: ___

โœ“ Query patterns
โ–ก Global Top-K
โ–ก Regional/Segmented Top-K
โ–ก Time-windowed Top-K
โ–ก Real-time updates needed?

โœ“ Accuracy requirements
โ–ก Exact required? (ยฑ0%)
โ–ก High accuracy? (ยฑ1-2%)
โ–ก Approximate OK? (ยฑ5-10%)

โœ“ Latency requirements
โ–ก Real-time (<1s)
โ–ก Near real-time (1-10s)
โ–ก Batch (minutes/hours)

Phase 2: Algorithm Selectionโ€‹

Phase 3: Data Structure Selectionโ€‹

Comparison Table:

Data StructureSpaceUpdateQueryUse Case
HashMap + HeapO(N)O(log K)O(1)Small N, exact counts
Count-Min SketchO(w*d)O(d)O(d)Large N, approximate
Count SketchO(w*d)O(d)O(d)Better accuracy than CMS
Space-SavingO(K)O(1)O(K)Memory-critical, Top-K only
Lossy CountingO(1/ฮต)O(1)O(1/ฮต)Bounded error, simple
HyperLogLogO(m)O(1)O(1)Cardinality only
Bloom FilterO(m)O(k)O(k)Membership, not counting

14.3 Common Patterns & Anti-Patternsโ€‹

โœ… Best Practicesโ€‹

1. Layered Aggregation

Benefits:

  • Reduces network traffic by 10-100x
  • Enables parallel processing
  • Natural sharding strategy

2. Hybrid Counting Strategy

If item_count < threshold:
Use exact counting (HashMap)
Else:
Use Count-Min Sketch

3. Progressive Accuracy

Top 10:    Exact (99.9% accuracy)
Top 11-50: High precision (98% accuracy)
Top 51-100: Good enough (95% accuracy)
Top 100+: Approximate (90% accuracy)

4. Time-Based Partitioning

Partition by hour/day:
topk:2025-10-21-10 โ†’ Hour window
topk:2025-10-21 โ†’ Day window
topk:2025-10 โ†’ Month window

Enable:
- Parallel processing
- Easy TTL/expiration
- Historical queries

โŒ Anti-Patterns to Avoidโ€‹

1. The "Everything Exact" Trap

โŒ Storing exact counts for billions of items
โœ… Use probabilistic for tail, exact for head

2. The "Single Point of Aggregation"

โŒ One server aggregates all events
โœ… Hierarchical aggregation across multiple layers

3. The "No Time Windows"

โŒ Only all-time Top-K
โœ… Multiple windows (1h, 24h, 7d, 30d, all-time)

4. The "Synchronous Updates"

โŒ Update Top-K on every single event
โœ… Batch updates or micro-batches (100-1000 events)

5. The "Ignoring Cold Start"

โŒ Empty Top-K at system startup
โœ… Bootstrap from historical data or defaults

14.4 Scalability Patternsโ€‹

Horizontal Scaling Strategyโ€‹

Merge Algorithm for Distributed Top-K:

def merge_topk_lists(partition_topks: List[List[Item]], k: int):
"""
Merge Top-K from multiple partitions
Time: O(P * K * log K) where P = partitions
"""
min_heap = []

# Initialize with first item from each partition
for partition_id, items in enumerate(partition_topks):
if items:
heapq.heappush(min_heap, (-items[0].count, partition_id, 0))

result = []
while min_heap and len(result) < k:
neg_count, partition_id, idx = heapq.heappop(min_heap)
items = partition_topks[partition_id]
result.append(items[idx])

# Add next item from same partition
if idx + 1 < len(items):
heapq.heappush(min_heap, (-items[idx+1].count, partition_id, idx+1))

return result

Vertical Scaling Strategyโ€‹

14.5 Testing & Validationโ€‹

Correctness Testingโ€‹

Accuracy Validation Script:

def validate_topk_accuracy(exact_counts, approximate_topk, k):
"""
Validate Top-K accuracy
"""
# Get ground truth Top-K
ground_truth = sorted(exact_counts.items(),
key=lambda x: x[1],
reverse=True)[:k]

# Calculate metrics
precision = len(set(approximate_topk) & set(ground_truth)) / k

# Rank correlation (Kendall's Tau)
rank_correlation = calculate_kendall_tau(ground_truth, approximate_topk)

# Count error
avg_count_error = sum(abs(exact_counts[item] - approx_count)
for item, approx_count in approximate_topk) / k

return {
'precision': precision,
'rank_correlation': rank_correlation,
'avg_count_error': avg_count_error
}

14.6 Monitoring Templateโ€‹

Universal Metrics Dashboard:

Alert Thresholds (Generic):

alerts:
critical:
- processing_lag > 60s
- error_rate > 1%
- query_latency_p99 > 1s
- cache_hit_rate < 90%

warning:
- processing_lag > 30s
- error_rate > 0.1%
- memory_usage > 80%
- topk_accuracy < 95%

14.7 Configuration Templateโ€‹

Reusable Configuration Pattern:

top_k_config:
# Core parameters
k: 100 # Number of top items
update_interval_ms: 1000 # How often to recompute Top-K

# Algorithm selection
algorithm:
small_scale: "exact_heap" # < 100K items
medium_scale: "count_min_sketch" # 100K-10M items
large_scale: "space_saving" # > 10M items

# Count-Min Sketch parameters
cms:
width: 2048 # w = ceil(e / epsilon)
depth: 7 # d = ceil(ln(1 / delta))
epsilon: 0.001 # Error rate: 0.1%
delta: 0.001 # Confidence: 99.9%

# Time windows
windows:
- name: "realtime"
duration: "5m"
slide: "1m"
- name: "hourly"
duration: "1h"
slide: "5m"
- name: "daily"
duration: "24h"
slide: "1h"
- name: "weekly"
duration: "7d"
slide: "1d"

# Storage tiers
storage:
hot:
type: "redis"
ttl: "1h"
max_size: "1GB"
warm:
type: "cassandra"
ttl: "30d"
cold:
type: "s3"
ttl: "365d"

# Performance tuning
performance:
batch_size: 1000 # Events per batch
parallelism: 16 # Parallel workers
buffer_size: 10000 # Event buffer
checkpoint_interval: "60s" # State checkpoint

# Scaling thresholds
auto_scale:
scale_up_threshold: 0.8 # CPU/Memory %
scale_down_threshold: 0.3
min_instances: 2
max_instances: 20

14.8 Migration Strategyโ€‹

From Exact to Approximate:

14.9 Quick Reference: Problem โ†’ Solutionโ€‹

Problem TypeKey ChallengeRecommended Approach
Trending HashtagsTime decayExponential decay + Count-Min Sketch
Popular ProductsMultiple categoriesMulti-dimensional Top-K + Redis
Viral VideosSudden spikesAuto-scaling + Heavy hitter detection
Hot SearchesShort-lived trendsSliding windows + Space-Saving
Top SellersExact revenueExact counting + Sampling for tail
Frequent BuyersUser segmentationHierarchical Top-K per segment
Popular ArticlesTime + engagementComposite score + Min-Heap
Trending TopicsMulti-regionDistributed Top-K + Merge

15. Summary & Trade-offsโ€‹

Design Decisionsโ€‹

AspectChoiceAlternativeTrade-off
CountingCount-Min SketchExact countersAccuracy vs Memory
Top-KMin-Heap + CMSSpace-SavingSimplicity vs Accuracy
Stream ProcessingApache FlinkSpark StreamingLatency vs Maturity
CacheRedis Sorted SetsMemcachedFeatures vs Speed
StorageCassandraPostgreSQLScale vs Simplicity
ArchitectureKappaLambdaSimplicity vs Accuracy

Key Takeawaysโ€‹

  1. Probabilistic data structures (Count-Min Sketch) enable massive scale
  2. Stream processing provides real-time updates with low latency
  3. Time windows support different use cases (trending vs popular)
  4. Partitioning by video_id enables parallel processing
  5. Hierarchical aggregation reduces network traffic
  6. Multi-layer caching (Redis + Cassandra + S3) optimizes cost
  7. Accuracy trade-off (90-95%) acceptable for this use case

Scalabilityโ€‹

The system can scale to:

  • โœ… 500,000+ views/second
  • โœ… 5+ billion videos
  • โœ… Sub-second query latency
  • โœ… 99.99% availability
  • โœ… Global distribution across regions

Total System Capacity:

  • Throughput: 500K events/sec
  • Latency: <1 second end-to-end
  • Accuracy: 90-95% (configurable via CMS parameters)
  • Storage: ~10 MB per Top-K (Count-Min Sketch)
  • Scale: Horizontally scalable to billions of videos